load libraries

Loading libraries

##install.packages("tidyverse")
library(magrittr)
require(FactoMineR)
library(ggplot2)
library(caret)
library(dplyr)
library(R2jags)
library(plotly)

Research Question, Does irradiation (radiation treatment) effect recurrence of or non recurrence of breast cancer

We will use Breast cancer data set (a dataset containing categorical data attributes) to try to ascertain the effect of irradiation (radiation treatment) on the recurrence or non-recurrence of breast cancer in the observed data.

Plan and collect data from a relevant source

The data was chosen from, https://archive.ics.uci.edu/ml/datasets/breast+cancer. The data contains a data file containing the data and a names file containing some meta data about the data. The names of the data attributes are given: 1. Class: no-recurrence-events, recurrence-events 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. 3. menopause: lt40, ge40, premeno. 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. 6. node-caps: yes, no. 7. degree.malignant: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. irradiation: yes, no.

It should be noted that all data are categorical in nature. We will read in the data, set the column names and output the first 5 rows to see the form of the data, we will also set the degree.malgnant data attribute to a factor as this is ordinal data not numeric.

breast.cancer<-read.csv("breast-cancer.data",header=FALSE,sep=",")
colnames(breast.cancer)<-c("Class","age","menopause","tumor.size","inv.nodes","node.caps",
                           "degree.malignant","breast","breast.quadrant","irradiation")


breast.cancer$degree.malignant<- breast.cancer$degree.malignant %>% as.character() %>% as.factor()

cols<-c("Class","age","menopause","tumor.size","inv.nodes","node.caps","breast","breast.quadrant","irradiation")

breast.cancer[,cols] <- lapply(breast.cancer[,cols],as.factor)


head(breast.cancer)
##                  Class   age menopause tumor.size inv.nodes node.caps
## 1 no-recurrence-events 30-39   premeno      30-34       0-2        no
## 2 no-recurrence-events 40-49   premeno      20-24       0-2        no
## 3 no-recurrence-events 40-49   premeno      20-24       0-2        no
## 4 no-recurrence-events 60-69      ge40      15-19       0-2        no
## 5 no-recurrence-events 40-49   premeno        0-4       0-2        no
## 6 no-recurrence-events 60-69      ge40      15-19       0-2        no
##   degree.malignant breast breast.quadrant irradiation
## 1                3   left        left_low          no
## 2                2  right        right_up          no
## 3                2   left        left_low          no
## 4                2  right         left_up          no
## 5                2  right       right_low          no
## 6                2   left        left_low          no

Next it is important to perform a summary of the data, to check for missing data and to look at the modal classes for each attribute type

summary(breast.cancer)
##                   Class        age       menopause     tumor.size inv.nodes  
##  no-recurrence-events:201   20-29: 1   ge40   :129   30-34  :60   0-2  :213  
##  recurrence-events   : 85   30-39:36   lt40   :  7   25-29  :54   12-14:  3  
##                             40-49:90   premeno:150   20-24  :50   15-17:  6  
##                             50-59:96                 15-19  :30   24-26:  1  
##                             60-69:57                 10-14  :28   3-5  : 36  
##                             70-79: 6                 40-44  :22   6-8  : 17  
##                                                      (Other):42   9-11 : 10  
##  node.caps degree.malignant   breast     breast.quadrant irradiation
##  ?  :  8   1: 71            left :152   ?        :  1    no :218    
##  no :222   2:130            right:134   central  : 21    yes: 68    
##  yes: 56   3: 85                        left_low :110               
##                                         left_up  : 97               
##                                         right_low: 24               
##                                         right_up : 33               
## 

From the summary it can be seen that there is some missing data from node.caps and breast quadrant data attributes, but apart from that the dataset looks to be of good quality, at least for the purpose of this exercise.

Explore data

To try to ascertain if any interesting relationships exist in the data a number of data visualization techniques were used, these included: 1. Multiple Correspondence Analysis plot. Here we run multiple correspondence on the data and visualize the results in the hope of identifying strong associations between different attributes. 2. Visualizing the data in the form of stacked bar charts (and stacked bar percent) to look at the relationships/associations between variables. 3. Using contingency tables to look at the proportions between different attributes usign dplyr and table.

Multiple Correspondence Analysis plot

cats = apply(breast.cancer, 2, function(x) nlevels(as.factor(x)))
cats
##            Class              age        menopause       tumor.size 
##                2                6                3               11 
##        inv.nodes        node.caps degree.malignant           breast 
##                7                3                3                2 
##  breast.quadrant      irradiation 
##                6                2
mca1 = MCA(breast.cancer, graph = FALSE)
mca1_vars_df = data.frame(mca1$var$coord, Variable = rep(names(cats),
                                                         cats))
mca1_obs_df = data.frame(mca1$ind$coord)
# plot of variable categories
ggplot(data = mca1_vars_df, aes(x = Dim.1, y = Dim.2, label = rownames(mca1_vars_df))) +
  geom_hline(yintercept = 0, colour = "gray70") + geom_vline(xintercept = 0,
                                                             colour = "gray70") + geom_text(aes(colour = Variable)) + ggtitle("MCA plot of breast cancer variables")

From the plot we can see some potentially interesting associations: 1. There appears to be a strong association between irradiation and recurrence events as opposed to a strong association with no irradiation and no-recurrence events. 2. There also appears to be a strong relationship between node-cap-no and Class.

This is a little clearer if we create an interactive plot, which allows us to interactively explore the data:

##            Class              age        menopause       tumor.size 
##                2                6                3               11 
##        inv.nodes        node.caps degree.malignant           breast 
##                7                3                3                2 
##  breast.quadrant      irradiation 
##                6                2

Visualizing the data in the form of stacked bar charts (and stacked bar percent) to look at the relationships/associations between variables.

It iss clear from the bar chart there is a stronger association between irradiation and recurrence events as opposed to no irradiation and no recurrence events. The same data is shown below in a standard stacked bar chart form.

Contingency tables to look at the proportions between different attributes usign dplyr and table.

The above relationships can also be viewed as contingency tables using either (a) dplyr or the (b) table function. In both cases a strong association is seen between irradiation and recurrence-events

breast.cancer %>% group_by(irradiation,Class) %>% tally()
## # A tibble: 4 x 3
## # Groups:   irradiation [2]
##   irradiation Class                    n
##   <fct>       <fct>                <int>
## 1 no          no-recurrence-events   164
## 2 no          recurrence-events       54
## 3 yes         no-recurrence-events    37
## 4 yes         recurrence-events       31

or as a contingency table

table(breast.cancer$irradiation,breast.cancer$Class)
##      
##       no-recurrence-events recurrence-events
##   no                   164                54
##   yes                   37                31

Postulating a model

We will build a model which aims to show the proportions could not have come equal proportions (i.e there is no association between them) as opposed to the proportions being unequal and a strong association exists between the groups (irradiation and recurrence events). This is equivalent to a bayesian re-interpretation of the proportions test.

Describe how the model is well suited to answer your question.

The likelihood model indicates that the observed counts are modeled by a binomial distribution with a probability of p (fraction for each group/total samples) from n trials (items= tot_samples).

Identify how inference for parameters in the model will provide evidence relating to your question.

The prior on each p is defined as a beta distribution with shape parameters a and b The hyperpriors for each a and b are drawn from imprecise (non informative) gamma distributions. As input, JAGS is supplied with: 1. The observed data (obs) 2. The total number of observed items (n), total samples=164+54+37+31 (see contingency table above) 3. The number of classification groups (nGroups)

Write the full hierarchical specification of the model.

The model specification is shown below:

modelString1="
model {
   #Likelihood
  for (i in 1:nGroups) {
    obs[i] ~ dbin(p[i],n[i])
    p[i] ~ dbeta(a[i],b[i])
    a[i] ~ dgamma(1,0.01)
    b[i] ~ dgamma(1,0.01)
  }
}
"

Justify your choice of prior distributions.

The beta distribution is used as this is the conjugate prior to the beta distribution, what this means is that if the likelihood function is binomial and the prior distribution, then the posterior distribution is also a beta distribution.

Fit the model

The model is fit below, as stated in our model specification: 1. The observed data (obs), the observed frequencies of the different groups supplied from the contingency table, (see above). 2. The total number of observed items (n), total samples=164+54+37+31 (see contingency table above) 3. The number of classification groups (nGroups)

tot_samples=164+54+37+31
params<-c("p")

obs <- c(54, 164,37,31)
data.list <- list(obs = obs, n = c(tot_samples, tot_samples,tot_samples,tot_samples), nGroups = 4)
mod1 = jags.model(textConnection(modelString1), data=data.list, inits=NULL, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 4
##    Unobserved stochastic nodes: 12
##    Total graph size: 23
## 
## Initializing model
update(mod1, 5e3) ## burn in 5000
mod_sim1 = coda.samples(model=mod1,
                        variable.names=params,
                        n.iter=15e3)
mod_csim1 = as.mcmc(do.call(rbind, mod_sim1)) # combined chains

Check the model

A series of checks are run these include: 1. Checking the trace plots of the parameters p[1] to p[4]. 2. Running the Gelman-Rubin diagnostic. 3. Checking the autocorrelation diagnostic. 4. Checking the effective size of the model.

Checking the trace plots of the parameters p[1] to p[4].

No long term trends are observed in the trace plots. These trace plots look sufficient.

plot(mod_sim1)

Running the gelman diagnostic.

The Gelman–Rubin convergence diagnostic provides a numerical convergence summary based on multiple chains. The potential scale reudction factor is 1 suggesting convergence.

gelman.diag(mod_sim1)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## p[1]          1          1
## p[2]          1          1
## p[3]          1          1
## p[4]          1          1
## 
## Multivariate psrf
## 
## 1

Checking the autocorrelation diagnostic.

autocorr.diag(mod_sim1)
##                 p[1]         p[2]        p[3]          p[4]
## Lag 0   1.0000000000  1.000000000 1.000000000  1.0000000000
## Lag 1   0.5117836407  0.506058996 0.525019580  0.5119415893
## Lag 5   0.0563745964  0.051586563 0.064095842  0.0648092712
## Lag 10  0.0055051317  0.014828944 0.008596285  0.0014939128
## Lag 50 -0.0007503344 -0.006957113 0.004299756 -0.0002990439

We see very high autocorrelation with initial values (first two terms) for p[1],p[3] and p[4]. This should prompt us to check the effective size.

checking the effective size

The effective sample sizes show they are close to the chain lengths indicating that they mixed very well

effectiveSize(mod_sim1)
##     p[1]     p[2]     p[3]     p[4] 
## 13490.51 13635.09 12918.61 13189.83

##Summary information

summary(mod_sim1)
## 
## Iterations = 6001:21000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean      SD  Naive SE Time-series SE
## p[1] 0.1910 0.02298 1.083e-04      0.0001984
## p[2] 0.5730 0.02884 1.359e-04      0.0002472
## p[3] 0.1322 0.01991 9.387e-05      0.0001753
## p[4] 0.1108 0.01838 8.665e-05      0.0001602
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%    50%    75%  97.5%
## p[1] 0.14802 0.17498 0.1905 0.2064 0.2375
## p[2] 0.51676 0.55348 0.5730 0.5926 0.6298
## p[3] 0.09598 0.11825 0.1313 0.1452 0.1737
## p[4] 0.07757 0.09789 0.1098 0.1226 0.1495

Iterate if necessary

Is the model adequate?

The model diagnostic’s appear to show the model is adequate, however for completion we will retrain the model specifying a less vague conjucate prior beta distribution.

Fitting the alternative

We fit an alternative where gamma the distribution is created with shape and scale parameter, 9 and 0.5

modelString1A="
model {
   #Likelihood
  for (i in 1:nGroups) {
    obs[i] ~ dbin(p[i],n[i])
    p[i] ~ dbeta(a[i],b[i])
    a[i] ~ dgamma(9,0.5)
    b[i] ~ dgamma(9,0.5)
  }
}
"

tot_samples=164+54+37+31
params<-c("p")

obs <- c(54, 164,37,31)
data.list <- list(obs = obs, n = c(tot_samples, tot_samples,tot_samples,tot_samples), nGroups = 4)
mod1a = jags.model(textConnection(modelString1A), data=data.list, inits=NULL, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 4
##    Unobserved stochastic nodes: 12
##    Total graph size: 23
## 
## Initializing model
update(mod1a, 5e3) ## burn in 5000
mod_sim1a = coda.samples(model=mod1a,
                        variable.names=params,
                        n.iter=15e3)
mod_csim1a = as.mcmc(do.call(rbind, mod_sim1a)) # combined chains

Next test the diagnostics for the model

plot(mod_sim1a)

gelman.diag(mod_sim1a)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## p[1]          1          1
## p[2]          1          1
## p[3]          1          1
## p[4]          1          1
## 
## Multivariate psrf
## 
## 1
autocorr.diag(mod_sim1a)
##                p[1]         p[2]         p[3]         p[4]
## Lag 0   1.000000000  1.000000000  1.000000000  1.000000000
## Lag 1   0.290110280  0.297757663  0.307954406  0.307492168
## Lag 5   0.004146988  0.004707418  0.014287900  0.008226029
## Lag 10 -0.004659602  0.003063428 -0.002091526  0.002490950
## Lag 50  0.003840506 -0.001602570 -0.015067653 -0.005214239
effectiveSize(mod_sim1a)
##     p[1]     p[2]     p[3]     p[4] 
## 24555.14 23786.48 23319.39 22984.71

Note the effective size is larger and the lag correlations are not as high for p[1] to p[4]

The summary information is shown below, it is approximately the same as the fist iteration

summary(mod_sim1a)
## 
## Iterations = 6001:21000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean      SD  Naive SE Time-series SE
## p[1] 0.2003 0.02329 1.098e-04      0.0001487
## p[2] 0.5706 0.02881 1.358e-04      0.0001868
## p[3] 0.1422 0.02045 9.640e-05      0.0001340
## p[4] 0.1217 0.01927 9.086e-05      0.0001272
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%    50%    75%  97.5%
## p[1] 0.15643 0.1841 0.1996 0.2158 0.2476
## p[2] 0.51398 0.5512 0.5708 0.5901 0.6272
## p[3] 0.10472 0.1280 0.1414 0.1555 0.1846
## p[4] 0.08651 0.1081 0.1209 0.1342 0.1620

Using the model

Provide relevant posterior summaries.

The summary information is listed below (detailing the posterior distributions), we can see that collectively, the fractions of 1/4, 1/4,1/4,1/4 (0.25,0.25,0.25,0.25) do not fall within these ranges.

summary(mod_sim1a)
## 
## Iterations = 6001:21000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean      SD  Naive SE Time-series SE
## p[1] 0.2003 0.02329 1.098e-04      0.0001487
## p[2] 0.5706 0.02881 1.358e-04      0.0001868
## p[3] 0.1422 0.02045 9.640e-05      0.0001340
## p[4] 0.1217 0.01927 9.086e-05      0.0001272
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%    50%    75%  97.5%
## p[1] 0.15643 0.1841 0.1996 0.2158 0.2476
## p[2] 0.51398 0.5512 0.5708 0.5901 0.6272
## p[3] 0.10472 0.1280 0.1414 0.1555 0.1846
## p[4] 0.08651 0.1081 0.1209 0.1342 0.1620

Interpret the model results in the context of the problem and come to a conclusion

There appears to be a strong association between irradiation and recurrence of cancer, and non irradiation and non recurence of cancer.